Predictive Reliability and Fault Management in Exascale Systems

نویسندگان
چکیده

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Introspective Fault Tolerance for Exascale Systems∗

Faults and errors are an unavoidable aspect of high performance computing systems. Emerging exascale systems will contain billions of hardware components and complex software stacks. In addition, higher fabrication density and power challenges will further compound fault detection, management and recovery. Efficient fault tolerance and resiliency frameworks are thus of immense importance in the...

متن کامل

Exploring reliability of exascale systems through simulations

Exascale computers are predicted to emerge by the end of this decade with millions of nodes and billions of concurrent cores/threads. One of the most critical challenges for exascale computing is how to effectively and efficiently maintain the system reliability. Checkpointing is the state-of-theart technique for high-end computing system reliability that has proved to work well for current pet...

متن کامل

Total order broadcast for fault tolerant exascale systems

In the process of designing a new fault tolerant run-time for future exascale systems, we discovered that a total order broadcast would be necessary. That is, nodes of a supercomputer should be able to broadcast messages to other nodes even in the face of failures. All messages should be seen in the same order at all nodes. While this is a well studied problem in distributed systems, few resear...

متن کامل

Restoring Reliability in Fault Tolerant Reconfigurable Systems

The new generations of SRAM-based FPGA devices, built on nanometer technology, are the preferred choice for the implementation of reconfigurable computing platforms. However, smaller technological scales increase their vulnerability to manufacturing imperfections and hence to the occurrence of electromigration. Moreover, the large internal RAM (for configuration purposes or as embedded memory b...

متن کامل

Power Management for Exascale∗

Most performance studies of large-scale HPC systems and their workloads have focused primarily on flops, bandwidth, and latency. Few concrete studies exist that focus on quantifying power and energy consumption at the hardware and software levels. Until recently, system vendors have had little incentive to expose extensive system and component-level power interfaces to users. Consequently, the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: ACM Computing Surveys

سال: 2020

ISSN: 0360-0300,1557-7341

DOI: 10.1145/3403956